Order Preserving Matching 



Jinil Kiin^, Peter Eades'^, Rudolf Fleischer'^'^, Seok-Hee Hong'', Costas S. Iliopoulos'^'^, 
Kunsoo Park**'*, Simon J. Puglisi", Takeshi Tokuyama*' 

Department of Computer Science and Engineering, Seoul National University, South Korea. 
''School of Information Technologies, University of Sydney, Australia. 
'^SCS and IIPL, Fudan University, Shanghai, China. 
'^Department of Applied Information Technology, German University of Technology in Oman, Muscat, Oman. 
"Department of Informatics, King's College London, London, United Kingdom. 
^Digital Ecosystems and Business Intelligence Institute, Curtin University, Australia. 
3 Graduate School of Information Sciences, Tohoku University, .Japan. 

o 

(N 

X) 

(D 

Abstract 

j We introduce a new string matching problem called order-preserving matching on numeric strings 

where a pattern matches a text if the text contains a substring whose relative orders coincide with 
those of the pattern. Order-preserving matching is applicable to many scenarios such as stock price 
analysis and musical melody matching in which the order relations should be matched instead of 
the strings themselves. Solving order-preserving matching has to do with representations of order 
(/J relations of a numeric string. We define prefix representation and nearest neighbor representation, 

, ^ , which lead to efficient algorithms for order-preserving matching. We present efficient algorithms 

for single and multiple pattern cases. For the single pattern case, we give an O(nlogm) time 
^-H algorithm and optimize it further to obtain 0{n + to log to) time. For the multiple pattern case, 

^ we give an O(nlogm) time algorithm. 

^sO Keywords: string matching, numeric string, order relation, multiple pattern matching, KMP 

algorithm, Aho-Corasick algorithm 
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1. Introduction 

_ ^ String matching is one of fundamental problems which has been extensively studied in stringol- 

ogy. Sometimes a string consists of numeric values instead of characters in an alphabet, and we are 
^ interested in some trends in the text rather than specific patterns. For example, in a stock market, 

^ analysts may wonder whether there is a period when the share price of a company dropped con- 

secutively for 10 days and then went up for the next 5 days. In such cases, the changing patterns 
of share prices are more meaningful than the absolute prices themselves. Another example can be 
found in the melody matching of two musical scores. A musician may be interested in whether her 
new song has a melody similar to well-known songs. As many variations are possible in a melody 
where the relative heights of pitches are preserved but the absolute pitches can be changed, it 
would be reasonable to match relative pitches instead of absolute pitches to find similar phrases. 

An order-preserving matching can be helpful in both examples where a pattern is matched with 
the text if the text contains a substring whose relative orders coincide with those of the pattern. 
For example, in Fig. [l] pattern P ~ (33, 42, 73, 57, 63, 87, 95, 79) is matched with text T since the 
substring (21,24,50,29,36,73,85,63) in the text has the same relative orders as the pattern. In 
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Figure 1: Example of pattern and text 



both strings, the first characters 33 and 21 are the smallest, the second characters 42 and 24 are 
the second smallest, the third characters 73 and 50 are the 5-th smallest, and so on. If we regard 
prices of shares or absolute pitches of musical notes as numeric characters of the strings, both 
examples above can be modeled as order-preserving matching. 

Solving order-preserving matching has to do with representations of order relations of a numeric 
string. If we replace each character in a numeric string by its rank in the string, then we can obtain 
a (natural) representation of order relations. But this natural representation is not amenable to 
developing efficient algorithms because the rank of a character depends on the substring in which 
the rank is computed. Hence, we define a prefix representation of order relations which leads to 
an 0(n log to) time algorithm for order-preserving matching where n and m are the lengths of the 
text and the pattern, respectively. Surprisingly, however, there is an even better representation, 
called nearest neighbor representation, by which we were able to develop an 0(TOlogm -|- n) time 
algorithm. 

In this paper, we define a new class of string matching problem, called order-preserving match- 
ing, and present efficient algorithms for single and multiple pattern cases. For the single pat- 
tern case, we propose an 0(n log to) algorithm based on the Knuth-Morris-Pratt (KMP) algo- 
rithm |14[ 116] . and optimize it further to obtain 0(n -\- to log to) time. For the multiple pattern 
case, we present an O(nlogTO) algorithm based on the Aho-Corasick algorithm [I]. 

Related Work: {5, 7)-matching has been studied to search for similar patterns of numeric 
strings jB] US] [TU [TTl [121 IMl HI]- In this paradigm, two parameters 5 and 7 are given, and two 
numeric strings of the same length are matched if the maximum difference of the corresponding 
characters is at most S and the total sum of differences is at most 7. Several variants were 
studied to allow for don't care symbols fT^, transposition-invariant jT^ and gaps [5J [101 HZ]- 
On the other hand, some generalized matching problems such as parameterized matching [Bill], 
overlap matching [3] , and function matching [51 [S] were studied in which matching relations are 
defined differently so that some properties of two strings are matched instead of exact matching 
of characters |22| . However, none of them addresses the order relations which we focus on in this 
paper. 



2. Problem formulation 

2.1. Notations 

Let S denote the set of numbers such that a comparison of two numbers can be done in 
constant time, and let E* denote the set of strings over the alphabet S. Let \x\ denote the length 
of a string x. A string x is described by either a concatenation of characters x[l] ■ x[2] ■ ...a;[|a;|] or 
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a sequence of characters as (a;[l], a;[2], x[|x|]) interchangeably. For a string x, a substring x[i..j] 
be {x[i\, x[i + 1], x[j\) and the prefix x^ be a;[l..i]. The rank of a character c in string x is defined 
as rankx{c) = 1 + |{i : x[i] < c for 1 < i < \x\}\. For simplicity, we assume that all the numbers 
in a string are distinct. When a number occurs more than once in a string, we can extend our 
character definition to a pair of character and index in the string so that the characters in the 
string become distinct. 

2.2. Natural representation of order relations 

For a pattern P[l..m] € E* and a text T[l..n] e S*. a natural representation cr{x) of the order 
relations of a string x can be defined as cr[x) — rankx{x[l\) ■ rankx(x[2\)... ■ rankx{x[\x\]) . 

Definition 2.1 (Order-preserving matching). Given a text T[l..n] and a pattern P[l..m], T 
is matched with P at position i if a{T[i — rn + = cr{P). Order-preserving matching is the 

problem of finding all positions of T matched with P. 

For example, let's consider two strings P = (33, 42, 73, 57, 63, 87, 95, 79) and T = (11, 15, 33, 21, 
24, 50, 29, 36, 73, 85, 63, 69, 78, 88, 44, 62) shown in Fig[T] The natural representation of P is cr(P) = 
(1, 2, 5, 3, 4, 7, 8, 6) which is matched with T[4..11] = (21, 24, 50, 29, 36, 73, 85, 63) at position 4 but 
is not matched at the other positions of T. 

As the rank of a character depends on the substring in which the rank is calculated, the string 
matching algorithms with 0{n + m) time complexity such as KMP, Boyer-Moore |14| [IB] cannot 
be applied directly. For example, the rank of T[4] is 3 in T[1..8] but is changed to 1 in T[4..11]. 

The naive pattern matching algorithm is applicable to order-preserving matching if both the 
pattern and the text are converted to natural representations. If we use the order- statistic tree 
based on the red-black tree jT3], computing the rank of a character in the string x takes 0(log |a;|), 
which makes the computation time of the natural representation <t{x) be 0{\x\ log |a;|). The naive 
order-preserving matching algorithm computes a{P) in O(TOlogm) time and a{T[i..i-\-m — 1]) for 
each position i S [l..n — m-f 1] of text T in O(mlogm) time, and compares them in 0{m) time. As 
n — m-\-l positions are considered, the total time complexity becomes 0{{n — m-\-\) ■ (mlogm)) = 
O (nm log m). As this time complexity is much worse than 0{m + n) which we can obtain from 
the exact pattern matching, sophisticated matching techniques need to be considered for order- 
preserving matching as discussed in later sections. 

3. 0(71 log m) algorithm 

3.1. Prefix representation 

An alternative way of representing order relations is to use the rank of each character in the 
prefix. Formally, the prefix representation of order relations can be defined as /^(x) — rank^-^ (^^[l]) ■ 
rankx2{x[2])... ■ rankx^^^{x[\x\]). For example, the prefix representation of P in Figlljis /i(a;) = 
(1,2,3,3,4,6,7,6). 

An advantage of the prefix representation is that /L((a;)[i] can be computed without looking at 
characters in x[i -\- l..|a;|] ahead of position i. By using the order-statistic tree T for dynamic 
order statistics [14] containing characters of x[l..i— 1], n{x)[i] can be computed in 0(log \x\) time. 
Moreover, the prefix representation can be updated incrementally by inserting the next character 
to T or deleting the previous character from T. Specifically, when T contains the characters in 
x[l..i], fi{x[l..i + l])[i + 1] can be computed if x[i -f 1] is inserted to T, and fj,{x[2..i])[i — 1] can be 
computed if a:[l] is deleted from T. 

Note that there is a one-to-one mapping between the natural representation and the prefix 
representation. The number of all the distinct natural representations for a string of length 
n is nl which corresponds to the number of permutations, and the number of all the distinct 
prefix representations is nl too since there are i possible values for the i-th character of a prefix 
representation, which results in 1 • 2 • ...n — nl cases. For any natural representation of a string, 
there is a conversion function which returns the corresponding prefix representation and vice versa. 
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Function 


Description 


OS-Insert (7~, x,i) 


Insert to T 


OS-Delete(r, x) 


Delete all the characters of string x from T 


OS-Rank(r, c) 


Computes rank r of character c in T 


OS-Find-Prev-Index(T, c) 


Find the index i of the largest character less than c 


OS-Find-Next-Index(r, c) 


Find the index i of the smallest character greater than c 



Figure 2: List of functions on T for dynamic order statistics 
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shift: 6-n[6] 



H(T[11..16]) 



shift: 8-7t[8] 
Figure 3: Example of text search 
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The prefix representation of P is easily computable by inserting each character P[k] to T 
consecutively as in Compute-Prefix-Rep. The functions of order-statistic tree are listed up 
m Fig|2] We assume that the index i of a; is stored with x[i] in OS-Insert(T, x, i) to support 
OS-Find-Prev-Index(T, c) and OS-Find-Next-Index(T, c) where the index i of the largest 
(smallest) character less than (greater than) c is retrieved. 



Compute-Prefix-Rep(P) 

1 m ^ \P\ 

2 T ^ (/> 

3 OS-lNSERT(r, P, 1) 

4 /i(F)[l] ^ 1 

5 for A; -s— 2 to m 

6 OS-lNSERT(r, P, k) 

7 ^ OS-RANK(r,P[fc]) 

8 return /i(P) 

The time complexity of Compute-Prefix-Rep is 0(m log m) as each of OS-Insert and 
OS-Rank takes 0(logr7i) time and there are 0{m) number of such operations. 

3.2. KMP failure function 

The KMP-style failure function tt of order-preserving matching is well-defined under our prefix 
representation: 

f \n&^{k: ^l{P[l..k]) = ^l(P[q-k + l..q])^OY\<k<q} if g > 1 

Intuitively, tt means that the longest proper prefix ^(P[l..fc]) of P is matched with fJ.{P[q — 
k + l..q]) which is the prefix representation of the suffix of P[l..(7] with length k. For example, the 
failure function of P in Fig[l]is n[l..m ] = (0, 1, 2, 1, 2, 3, 3, 1). As shown in Fig|3] n[6] = 3 implies 
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that the longest prefix of /x(P[1..8]) which is matched with tlie prefix representation of any suffix 
of P[1..6] = (33, 42, 63, 57, 63, 87) is ^(P[1..7r[6]]) = (1, 2, 3). 
The construction algorithm of tt will be given in section |3.4| 



3.3. Text search 

The failure function tt can accelerate order-preserving matching by filtering mismatched posi- 
tions as in the KMP algorithm. Let's assume that /^(P)[l..g] is matched with iJ,{T[i — q..i — l])[l..q] 
but a mismatch is found between fi{P)[q + l] and ii{T[i — q..i])[q + l]. Tr[q] means that /i(P) [1..7r[g]] 
is already matched with n{T[i — 7r[g]..i — l])[1..7r[g]] and matching can be continued at P[7r[g] -|- 1] 
comparing with T[i\ from the position i — TT[q\. Since P[1..7r[g]] is the longest prefix whose order 
is matched with the suffix of T[i — q..i — 1], the positions from i — q to i — T:[q] — 1 can be skipped 
without any comparisons as in the KMP algorithm. Fig [3] shows how tt can filter mismatched 
positions. When ^(P)[1..6] is matched with /Lt(T[1..6]) but /i(P)[7] is different from ^(T[1..7])[7], 
we can skip the positions from 1 to 3. 

KMP-Order-Matcher describes the order-preserving matching algorithm assuming that 
/i(P) and TT are efficiently computable. In KMP-Order-Matcher, for each index i of T, q is 
maintained as the length of the longest prefix of P where ii{P)[l..q] is matched with fx{T)[i—q..i—l]. 
At that time, the order-statistic tree T contains all the characters of T[i — q..i — 1]. If the rank of 
T[i] in T is not matched with that of P[q + 1], q is reduced to TT[q] by deleting all the characters 
T[i-q..i-TT[q] - 1] from T. If P[g + 1] and T\i] have the same rank, ^(P)[l..q'-|- 1] = ii{T)[i- q..i] 
and the length of the matched pattern q is increased by 1. When q reaches m, the relative order 
of T[i — TO — matches that of P. 



KMP-ORDER-MATCHER(r, P) 

1 n ^ \T\, TO ^ |P| 

2 ^(P) -s- Compute-Prefix-Rep(P) 

3 TT KMP-Compute-Failure-Function(P, /l(P)) 

4 r ^ 

5 q ^ 

6 for i ^ 1 to n 



7 OS-lNSERT(r,r,i) 

8 r ^ OS-RANK(r,r[i]) 

9 while q>0 and r ^ fi{P)[q + 1] 

10 OS-DELETE(r,r[i-g..i-7r[g]-l]) 

11 g ^ 7r[q] 

12 r ^ OS-RANK(r,T[i]) 

13 q ^ q + l 

14 if q = TO 

15 print "pattern occurs at position" i 

16 g ^ irlq] 



KMP-Order-Matcher is different from the KMP algorithm of the exact pattern matching 
in that the matches are done by order relations instead of characters. For each position i of T, 
the prefix representation /i(T[i — q..i])[q + 1] of T[i] is computed using order-statistic tree T. If 
/i(T[i — q..i])[q + 1] does not match ^{P)[q + 1], q is reduced to 7r[g] so that P implicitly shifts 
right by g - 7r[g]. 

Another subtle difference is that we do not check whether r = ^(P)[g -|- 1] before increasing q 
by 1 in line 13 (cf. fl4|[T5] ) because it should be satisfied automatically. From the condition of the 
while loop in line|9] g = or r = /i(P)[g + 1] in line 13 and if g = 0, /i(P)[l] = 1 for any pattern 
and it matches any text of length 1 . 

The time required in KMP-Order-Matcher except the computation of the prefix repre- 
sentation of P and the construction of the failure function tt can be analyzed as follows. Each 
OS-Insert, OS-Rank can be done in O(logm) time while OS-Delete in O(logm) time per 



5 



index 



2 3 4 5 6 7 8 



KP) 1 2 3 


3 


4 


6 


m. 


6 




n 12 


12 3?? 


^(P[4..8]) 


1 


2 


3 


4 


3 





li(P) 1 


2 


3 1^ 


4 


6 


7 


6 






[3]=2 




H(P[5..8]) ^ 


1 


2 P"1 


2 










shift: 3-Jc[3] 




H(P) 


1 


2 3 


3 


4 


6 


7 


6 



7i[7]^3 

Figure 4: Example of computing failure function 



deleting each character. The number of OS-Insert is n, and the number of deletions is at most n, 
which makes the total time of deletions 0(n log m). In the same way, the number of OS-Rank is 
bounded by 2n. n for new characters, and the other n for the computation of rank after reducing q, 
the total cost of OS-Rank is also O(nlogm). To sum up, the time for KMP-Order-Matcher 
can be bounded by O(nlogTO) except the external functions. 

3.4- Construction of KMP failure function 

The construction of failure function tt can be done similarly to the text match as in the KMP 
algorithm where each element 7r[g] is computed by using the previous values 7r[l..g — 1]. 

KMP-Compute-Failure-Function describes the construction algorithm of tt. It first tries 
to compute 7r[g] starting from the match of /i(P[1..7r[q — 1]]) and ^{P\q — TT[q — l]..q — 1]). If 
/i(P[1..7r[9 - 1] + iMq - 1] + 1] = ^iiP[q - 7r[q - l]..q]Mq - 1] + 1], set 7r[g] = Tr[q - 1] + 1. 
Otherwise, it tries another match for 7r[7r[l..g — 1]], and repeats until Tr[q is computed. 

Fig [4] shows an example of computing failure functions on P in Fig. l]in which 7r[7] is being 
computed. Starting from q = 7r[6] = 3, KMP-Order-Matcher tries to match /l((P[4..8])[4] with 
/i(P)[4] but it fails. Then, q is decreased to q = 7r[3] = 2 and it tries to match //(P[5..8])[3] with 
/i(P)[3] and it succeeds. tt[7] is assigned to 7r[3] + 1, and the next iteration is started with q = 7r[7]. 



KMP-Compute-Failure-Function(P, /i(P)) 
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TO ^ |P| 
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T ^ (f> 
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k ^ 
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tt[1] ^ 
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for q ^ 2 to TO 
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OS-lNSERT(r,P,g) 
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while fc > and r 7^ /i(P)[fc 
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k 7r[fc] 


12 


r ^ OS-RANK(r,P[g]) 


13 


fc fc + 1 


14 


TT[q] -S— k 


15 


return tt 



6 



The time complexity of KMP-Compute- Failure-Function can be analyzed as in that of 
KMP-Order-Matcher by replacing the length of T with the length of P, which results in 
©(mlogm) time. 

3.5. Correctness and time complexity 

The correctness of our matching algorithm comes from the fact that the failure function is well 
defined as in the KMP algorithm. From the analysis of Section |3.3| and |3.4| it is clear that our 
algorithm does not miss any matching position. 

The total time complexity is 0(n log to) due to O (to log to) for prefix representation and failure 
function computation, 0(n log to) for text search. Compared with 0{n) time of the exact pattern 
matching, our algorithm has the overhead of O(logm) factor, which can be optimized at the 
subsequent section. 

3.6. Remark on the Boyer-Moore approach 

Variants of the Boyer-Moore algorithm [Tj UHl 123] may be designed for order-preserving match- 
ing in which case the prefix representation should be replaced by the sujjix representation to 
proceed matching from right to left of the pattern. The good suffix heuristic |7] is well-defined 
with the suffix representation, but the bad character heuristic [7] is not applicable since the charac- 
ter itself has nothing to do with order relations. As the performance of the Boyer-Moore algorithm 
is significantly dependent on the bad character heuristic, we cannot expect that the gain of Boyer- 
Moore variants for order-preserving matching is comparable to that of the original Boyer-Moore 
algorithm for the exact matching. Moreover, some practical algorithms such as the Horspool [18| 
and the Sunday algorithms |23| cannot be applied to order-preserving matching because they 
employ only the bad character heuristic for filtering mismatched positions. 

4. 0(n + m log m) algorithm 

4-.1. Nearest neighbor representation 

The text search of the previous algorithm can be optimized further to remove 0(log to) overhead 
of computing rank functions. In the text search of the 0(n log to) algorithm, the rank of each 
character T[i] in T[i — q ~ l..i] is computed to check whether it is matched with ^{P)[q + 1] 
when we know that ^{P)[l..q] is matched with ^i{T[i — q-\- !..«]). If we can do it directly without 
computing fjL{P)\q + 1], the overhead of the operations on T can be removed. 

The main idea is to check whether the order of each character in the text matches that of the 
corresponding character in the pattern by comparing characters themselves without computing 
rank values explicitly. When we need to check if a character x[i\ of string x has a specific rank 
value r in prefix Xi, we can do it by checking x[j] < a;[z] < x[k] where x[i] and x[k] are characters 
with the nearest rank values of r. 

The nearest neighbor representation of the order relations can be defined as follows. For string 
X, Vp{x)[l..\x\] and zy„(a;)[l..|a;|] are the nearest neighbor representation of x where fp(a;)[i] is 
the index of the largest character of Xi-i less than x\i] and i/„(a;)[«] is the index of the smallest 
character of xi^i greater than x[i]. Let i'p(a;)[?] = — oo if there is no character less than x[i] in 
Xi-i and let i/„(x)[i] = oo if there is no character greater than x[i] in x^-i. Let a;[— cxd] — —oo and 
a;[oo] = oo. 



The advantage of the nearest neighbor representation is that we can check whether each text 
character is matched with the corresponding pattern character in constant time without computing 
rank explicitly. Fig|5]shows the nearest neighbor representation of the order relations of P in Fig[l] 
Suppose that At(P)[l..i - 1] = n{T[l..i-\]) for 1 < i < to. li T[vp{x)[i]] < T[i\ < T[vn{x)[i\], then 
/i(P[l..i]) = /i(T[l..i]). For example, ^(T[l])[l] must be matched with n{P)[\] since T[i'p(a;)[l]] < 
c < T[i^„(a;)[l]] for any character c, which coincides with the fact that the rank in the text of size 
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Figure 5: Example of the nearest neighbor representation 



1 is always 1. For the second character, /x(a;)[2] = 2 and T[2] should be larger than T[l] to have 
/i(T[1..2])[2] = 2, which is represented by t'p(a;)[l] = 1 and ^'„(a;)[l] = oo. In this way, for each 
character, we can decide whether the order of T[i] in ^{T[l..i]) is matched with that of P[i] in 
At(P[l.i]) by checking T[Ppix)\i]] < T[i] < T[iy^{xM. 

Compute-Nearest-Neighbor-Rep describes the construction of the nearest neighbor rep- 
resentation of the string P where T contains the characters of Pk-i in each step of the loop. We 
assume that OS-Find-Prev-Index(T, c)(and OS-Find-Next-Index(T, c)) returns the index i 
of the largest (smallest) character less than (greater than) c, and returns — oo (oo) if there is no 
such character. 

Compute-Nearest-Neighbor-Rep(P) 

1 m ^ \P\ 

2 T ^ (/> 

3 OS-lNSERT(r, P, 1) 

4 (i.p(P)[l],z.„(P)[l]) ^ (-00,00) 

5 for /c 2 to m 

6 OS-lNSERT(r, P, k) 

7 fp(P)[fc] ^ OS-FlND-PREV-lNDEX(r,P[fc]) 

8 l^n{P)[k] ^ OS-FlND-NEXT-lNDEX(r,P[fc]) 

9 return (z/p(P),i/„(P)) 

The time complexity of Compute-Nearest-Neighbor-Rep is 0(m log m) since it has m 
iterations of the loop and there are 3 function calls on the order-statistic tree T taking O(logm) 
time in each iteration. 

4-2. Text search 

With the nearest neighbor representation of pattern P and the failure function tt, we can 
simplify text search so that it does not employ T at all. For each character T[i], we can check 
/i(P)[(7+l] = ii{T[i — q..i])[q + 1] by comparing T[i] with the characters in T[i — q..i] whose indexes 
correspond to Up{P)[q + 1] and z/„(P)[(7 -|- 1] in P. Specifically, if T[i - q + h'p{P)[q -|- 1] - 1] < 
T[i] <T[i-q + Vn{P)[q + !]-!], then iJ,{P)[q + 1] = ^J.{T[i - q..i])[q + 1] must be satisfied since 
the relative order of T[i\ in T[i — q..i\ is the same with that of P[q + 1] in P[\..q + 1]. 

For example, let's come back to the text matching example in Fig |3] When /i(P)[1..6] is 
matched with /i(T[1..6]), we can check /x(T[1..7])[7] is matched with /z(P)[7] by checking if T[7 — 
6 + Vp{P)[l] - 1] < P[7] < T[7 - 6 + i'„(P)[7] - 1], which can be done in constant time. As 
T[6] = 50, r[oo] = 00 but r[7] = 29, r[7] should have a rank lower than ii{P)[7], thus ^(r[1..7]) 
cannot be matched with /i(P)[1..7]. 

KMP-Order-Matcher2 describes the text search algorithm using the nearest neighbor rep- 
resentation. The algorithm is essentially equivalent to the previous one but simpler since no rank 
function has to be calculated explicitly. 
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KMP-ORDER-MATCHER2(r, P) 

1 n 4- \T\, m ^ \P\ 

2 {vp{P),Vn{P)) Compute-Nearest-Neighbor-Rep(P) 

3 TT ^ KMP-Compute-Failure-Function2(P, i^p(P),z^„(P)) 
4 

5 for i -s— 1 to n 



6 (Ji,j2) ^ (i'p(P)[<z + l],i^„(P)[(7 + l]) 

7 while q > and (r[i] < T[i - q + ji - 1] or T[i] > r[i - g + j2 - 1])) 

8 g 4- 7r[(j] 

9 (Ji,j2) ^ {u.p{P)[q+l\,u^{P)[q+l]) 

10 g ^ g + 1 

11 if q = TO 

12 print "pattern occurs at position" i 

13 g ^ n[q] 



The time complexity of KMP-Order-Matcher2 except the precomputation of the prefix 
representation and the failure function is 0{n) because only one scan of the text is required in the 
for loop as in the KMP algorithm. 



^.3. Construction of KMP failure function 

The construction of the failure function tt is an extension of KMP-Compute- Failure-Function 
where the rank functions on T is replaced with comparison of characters using Vp{P) 



3.4 



in section 

and Vn {P) as in KMP-Order-Matcher2. KMP-Compute-Failure-Function2 describes the 
construction of the KMP failure function from the nearest neighbor representation of pattern P. 



KMP-C0MPUTE-FAILURE-FUNCTI0N2(P, Vp{P),Vn{P)) 

1 m ^ \P\ 

2 A: ^ 

3 7r[l] ^ 

4 for q 4— 2 to TO 

5 (Ji,j2) ^ {yp{P)[k + l],u,,{P)[k+l]) 

6 while fc > and {P[q] < P[i - k + ji - 1] or P[q] > P[i - k + - 1])) 

7 fc 4- Tr[k] 

8 (Ji,j2) ^ iiypiP)[k + l],i^„{P)[k + l]) 

9 k ^ k + 1 

10 Tr[q] ^ k 

11 return tt 



The time complexity of KMP-Compute-Failure-Function2 is 0{m) from the Hnear scan 
of the pattern similarly to KMP-Order-Matcher2. 

4-.4- Correctness and Time Complexity 

The correctness of our optimized algorithm is derived from that of the previous 0(n log to) 
algorithm since the difference of the text search is only on rank comparison logic and each com- 
parison result is the same as that of the previous one. The same failure function tt is applied and 
the order-statistic tree T is only used to compute the nearest neighbor representation of P. 

The time complexity of the overall algorithm is 0{n + to log to): O(TOlogTO) time for the 
computation of the nearest neighbor representation of the pattern, and 0{n) time for text search, 
and 0{m) time for the construction of tt function. 0{n + to log to) is almost linear to the text 
length n when n is much larger than to. which is a typical case in pattern matching problems. 
The only non- linear factor log to comes from the representation of order relations. 
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4-.5. Generalized order-preserving matching 

A generalization of order-preserving matching is possible with some practical applications if 
we consider only the orders of the last k characters for a given k < m. For example, in the 
stock market scenario in the Introduction of finding a period when a share price of a company 
dropped consecutively for 10 days and then went up for the next 5 days, it is sufficient to compare 
each share price with that of the day before, which corresponds to fc = 1. Our solution is easily 
applicable to this generalized problem if the order-statistic tree T is maintained to keep only the 
last k characters of the inserted characters. The time complexity of the 0{n log m) algorithm with 
prefix representation becomes O(nlogfc) and that of the 0{n + mlogm) algorithm with nearest 
neighbor representation 0{n + mlogk) since the number of characters in T is bounded to k. Both 
time complexities are reduced to 0{n) if fc is a constant number. 

4-.6. Remark on the alphabet size 

We have no restrictions on the numbers in S, insofar as a comparison of two numbers can 
be done in constant time. In the case of E = {1,2, ... ,U}, however, the order-statistic tree 
in Compute-Nearest-Neighbor-Rep can be replaced by van Emde Boas tree jH] or y-fast 
trie |25] which takes 0{U) space and requires O(loglog[/) time per operation. 

5. C)(n log m) algorithm for multiple patterns 

In this section, we consider a generalization of order-preserving matching for multiple patterns. 

Definition 5.1 (Order-preserving matching for multiple patterns). Given a text T[l..n] 
and a set of patterns V = {Pi, P2, P-w} j order-preserving matching for multiple patterns is the 
problem of finding all positions of T matched with any pattern in V . 

We propose a variant of the Aho-Corasick algorithm [I] for the multiple pattern case whose 
time complexity is 0(n log m) where m is the sum of the lengths of the patterns. 

5.1. Prefix representation of Aho-Gorasick automaton 

From the prefix representation of the given patterns, an Aho-Corasick automaton can be defined 
to match order relations. The Aho-Corasick automaton consists of the following components. 

1. Q: a finite set of states where qq ^ Q is the initial state. 

2. g : Q X N„i — > Q U {fail}: a forward transition function. Nm is the set of integers in [l..m]. 

3. TT : Q — > Q: a failure function. 

4. d : Q Z: the length of the prefix represented by each state q. 

5. P : Q ^ V: a representative pattern of each state q which has the prefix represented by q. 
If there are more than one such patterns, we use the pattern with the smallest index. 

6. out : Q V U {4>}'. the output pattern of each state q. If q does not match any pattern, 
owi[g] = 4>, otherwise out[q] — Pi for the longest pattern Pi such that the prefix representation 
of Pi is matched with that of any suffix of P[g][l..(i[(7]]. 

Given the set of patterns, an Aho-Corasick automaton of the prefix representations is con- 
structed from a trie in which each node represents a prefix of the prefix representation of some 
pattern. The nodes of the trie are the states of the automaton and the root is the initial state go 
representing the empty prefix. Each node q is an accepting state if out[q] ^ (j), which means that 
q corresponds to the prefix representation of the pattern out[q]. The forward transition function 
g is defined so that g[qi,a] = qj when corresponds to ^{Pk)[l..d[qi]] and qj corresponds to 
^{Pk)\l ■ .d[qi] -\- 1] for some pattern Pk where a = /j,(Pfc)[(i[gi]]. The trie can be constructed in 
0{m) time once the prefix representation of the patterns are given. 

Fig.[6]shows an example of an Aho-Corasick automaton with three patterns Pi = {23, 35, 15, 53, 47}, 
P2 = {66,71,57,79,84,93}, P3 = {43,51,62,73}. The automaton is constructed from the prefix 
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Pi={23, 35, 15,53,47} 




Figure 6: Example of AC automaton and failure function 



representations n{Pi) = (1,2,1,4,4), n{P2) = (1,2,1,4,5,6) and ^(Pa) = (1,2,3,4) regardless of 
the pattern characters. For example, 95 represents the prefix (1, 2, 1, 4) which matches with /i(Pi) 
and H^iP^) even though Pi[1..4] and P2[l--4] have different characters. 

Compared to the original Aho-Corasick algorithm, we have two additional values d[q] and P[q] 
for each state q. Both of them are recorded to maintain the order-statistic tree per pattern during 
the construction of the failure function tt. The details are described in the following sections. 

5.2. Aho-Corasick failure function 

The failure function vr can be defined so that 'n[qi\ — qj if and only if the prefix represented 
by qj (i.e. ^(P[qj])[l..fi[gj]]) is the prefix representation of the longest proper suffix of P[qi] (i.e. 
fj,{P[qi][k..d[qi]]) for some k). For example, for in Fig. [6] with the prefix (1,2,1,4,5) of IJ,{P2), 
TT[qs] — qi because P2[3..5] is the longest proper suffix of P2 whose prefix representation (1, 2, 3) is 
the prefix of some pattern. Here, P[<Z4] = P3 and /i(P[(74])[1..3] — (1,2,3) which is matched with 
/i(i'2[3..5]). 

5.3. Text search 

A variant of the Aho-Corasick algorithm can be designed for the multiple pattern matching of 
order relations as in AC-Order-Matcher-Multiple. Assuming that the prefix representations 
of all the patterns and the failure function are available, it scans the text and follows the Aho- 
Corasick automaton until there is no matched forward transition. Then, it follows the failure 
function until a successful forward transition is found. In the initial state q^, it never fails to follow 
the forward transition because any character can be matched at the first character. Whenever it 
reaches one of the accepting states, it outputs the position of the text and the matched pattern. 

The order-statistic tree T is maintained to compute each rank value adaptively. For every 
forward transition, T[i] is inserted to T, and for every backward transition 7r[gi] = qj, the oldest 
d[qi\—d[qj\ characters are deleted from T. The rank of r[i] should be calculated again for each back- 
ward transition after T is properly updated. For example, when AC-Order-Matcher-Multiple 
reaches state (73 of Fig. [6] after reading the first three characters from the text (20,30, 10, 15), T 
contains {20, 30, 10} that is the prefix of the text represented by 93. As there is no forward tran- 
sition from 53 that matches the rank 2 of the next character 15, the state is changed to qi by 
following the failure transition. The oldest ^[93] — d[qi\ = 2 characters are deleted from T so 
that it contains {10} at the next step. The state is then changed to 52 by following the forward 
transition 2 with inserting 15 to T (which is rank 2 in {10, 15}). 
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AC-ORDER-MATCHER-MULTIPLE(r, T') 

1 n ^ \T\ 

2 for i -It- 1 to w 

3 ^(^0 ^ Compute-Prefix-Rep (Pi) 

4 {TT,out) Compute- AC-Failure-Function(T') 

5 T ^ (f> 

6 q ^ qa 

7 for i 1 to n 

8 OS-lNSERT(r,T,i) 

9 r ^ OS-RANK(r,r[i]) 

10 while .g[(7,r] = fail 

11 OS-DELETE(r,r[i - d[q]..i ~ d[TT[q]] - 1]) 

12 q ^ TT[q] 

13 r ^ OS-RANK(r,T[i]) 

14 (7^.g[g,r] 

15 if oui[g] ^ 

16 print "pattern" 0Mt[(7] "occurs at position" i 



The time complexity of AC-Order- Matcher-Multiple is 0(n log m) except the preprocess- 
ing of the patterns because it does n insertions in T and thus at most n deletions can take place. 
Checking g[q, r] in line 10 takes O(logm) time as well. As each operation takes O(logm) time and 
there are 0{n) operations, the total time is O(nlogTO). 



5.4- Construction of Aho-Corasick failure function 

Compute- AC-Failure-Function shows the construction algorithm of the Aho-Corasick fail- 
ure function. As in the original Aho-Corasick algorithm, it computes the failure function in the 
breadth first order of the automaton. 

The main difference from the original Aho-Corasick algorithm is that we maintain multiple 
order-statistic trees simultaneously (one per pattern) because the rank value of a character depends 
on the pattern in which the rank is calculated. Let T{Pi) denote the order-statistic tree for the 
pattern Pi, and let's assume that a representative pattern P[q] is recorded for each node q such 
that q is reachable by some prefix of the prefix representation of P[q]. 

We maintain each order-statistic tree T{P[q]) of P[q] so that it contains the characters of the 
longest proper suffix of P[(j][l..(i[(7]] whose prefix representation is a prefix of the prefix represen- 
tation of some pattern. Let's consider a forward transition g[qi, a] — qj such that TT[qi] is available 
but Tr[qj] is to be computed. If P[qi] = P[qj], T{P[qi]) = T{P[qj]) and T{P[qj]) already contains 
the characters of P[qj]- It can be updated by inserting P[(7j] and deleting some characters 
from T{P[qj]). However, if P[qi] ^ P[qj]j we should initialize T{P[qj]) by inserting characters of 
the suffix of P[qj][l..d[qj] — 1] so that it has the same number of characters as T(P[(?i]). T{P[qj]) 
then can be updated as in the other case. In both cases, the rank of P[gj] [(i[(7j]] in T{P[qj]) is 
computed again to find the correct forward transition starting from 7r[gi]. 

For instance, let's consider node q^ in Fig.|6] Pfgs] = Pi and T(Pi) has {15, 53} since c?[7r[g5]] = 
2. When 7r[g7] is computed, it inserts 47 to T(Pi) which is rank 2 in {15, 53, 47} and tries to follow 
the rank 2 from irlq^] = q2- As there is no forward transition of q2 with label 2, it follows the 
failure function Tr[q2\ — qi and deletes 15 from T(Pi). Similarly, there is no forward transition 
of the rank 1 of 47 in {53,47} from qi, it reaches qo. Finally, it follows the forward transition 
of qi by the rank 1 of 47 in {47} and 7r[g7] = qi. On the other hand, when TT[qs] is computed, 
Piqs] — P2 and P[qs] ^ P[q7]- The last d[7r[g5]] characters of P2[l..d[g5]] are inserted to T(P2), 
and T(P2) becomes {57,79}. Then, the next character 84 of P[qs] is inserted to T(P2) that is 
rank 3 of {57, 79, 84} and it follows the rank 3 from q2, which results in TT[qg] = q^. 
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Compute-AC-Failure-Function(T, T') 



1 7r[9o] ^ go 

2 for each G "P 

3 r(p,) ^ 

4 oui[(7j] -s— Pi for the last state qi of P; 

5 for each Qi £ Q (BFS order) 

6 for each a such that g [qi , a] ^ fail 

7 ^ g[qi,a], c ^ P[qj][d[qj]] 

8 if P[q.,] ^ Pfe] 

9 for fc ^ 1 to (i[7r[(7i]] 

10 OS-lNSERTiT{P[q,]),P[qj].d[q,] - a![7rfe]] + k) 

11 OS-lNSERT(r(Pfe]), Pfc], dfe]) 

12 r ^ OS-RANK(r(P[gj]),c) 

13 Qp ^ qi, qh ^ 7r[(7i] 

14 while = fail 

15 OS-DELETE(r(Pfe]),P[g,][i - d[qp] + - d[q,,]]) 

16 r ^ OS-RANK(r(P[gj]),c) 

17 g/i, (jh ^ 7r[g,i] 

18 TT[qj] -f- g[g/i,r] 

19 if out[qj] = (j) 

20 out[qj] -s— owi[7r[qj]] 

21 return (tt, out) 



The time complexity of Compute- AC-Failure-Function can be analyzed as follows. The 
number of all the forward transitions is at most m and there are at most m insert operations on 
T because each character of a pattern can be inserted either in line [TO] or in line [TT] but cannot 
be in both. The number of deleted characters cannot exceed the number of inserted characters 
and the number of rank computations is also bounded by m. As the number of each operation is 
0{m) and each takes O (log to), the total time complexity is O(TOlogTO). 

5.5. Correctness and Time Complexity 

The correctness of our algorithm can be easily derived from the correctness of the original 
Aho-Corasick algorithm and our version for the single pattern matching. 

The total time complexity is 0(n log to) due to O (to log to) for prefix representation and failure 
function computation, O(nlogTO) for text search. Compared with 0(nlog |E|) time of the exact 
pattern matching where S is the alphabet, our algorithm has a comparable time complexity since 
|E| for numeric strings can be as large as to. 

Note that we cannot remove logm factor from the above time complexity as in the single 
pattern case since O (log to) time has to be spent at each state to find the forward transition to 
follow even with the nearest neighbor representation. 

6. Conclusion 

We have introduced order-preserving matching and defined prefix representation and nearest 
neighbor representation of order relations of a numeric string. By using these representations, we 
developed an 0(rt + mlogm) algorithm for single pattern matching and an O(nlogTO) algorithm 
for multiple pattern matching. We believe that our work opens a new direction in string matching 
of numeric strings with many practical applications. 
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